Model Selection

Image Caption Generation

# Image Caption Generation

Qwen2.5 VL 7B Instruct Gemlite Ao A8w8

This is a multimodal large language model quantized with A8W8, based on Qwen2.5-VL-7B-Instruct, supporting vision and language tasks.

Devstral Small Vision 2505 GGUF

Vision encoder based on Mistral Small model, supports image-text generation tasks, compatible with llama.cpp framework

A fine-tuned vision-language model based on Salesforce/blip2-opt-2.7b for visual question answering tasks

Blip Custom Captioning

BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation

Gemma 3 12b It Qat 3bit

This is an MLX-format model converted from the Google Gemma 3-12B model, supporting image-text-to-text tasks.

Transformers Other

GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.

PyTorch Supports Multiple Languages

Qwen2 VL 7B Captioner Relaxed GGUF

This model is a GGUF format conversion based on Qwen2-VL-7B-Captioner-Relaxed, optimized for image-to-text tasks and supports running via tools like llama.cpp and Koboldcpp.

Image-to-Text English

Llama Joycaption Alpha Two Hf Llava FP8 Dynamic

This is an FP8 compressed version of the Llama JoyCaption Alpha Two model developed by fancyfeast, implemented using the llm-compressor tool and compatible with the vllm framework.

Image-to-Text English

Blip Image Captioning Large

A vision-language model pre-trained on the COCO dataset, excelling in generating accurate image descriptions

Florence 2 Base Castollux V0.4

An image caption generation model fine-tuned based on microsoft/Florence-2-base, focusing on improving description quality and format

Transformers English

PJMixers-Images

Molmo 7B D 0924 NF4

The 4Bit quantized version of Molmo-7B-D-0924, which reduces VRAM usage through the NF4 quantization strategy and is suitable for environments with limited VRAM.

LLaVA-Llama3 is a multimodal model based on Llama-3, supporting joint processing of images and text.

Qwen2 VL 7B Captioner Relaxed Q4 K M GGUF

This is a GGUF format model converted from the Qwen2-VL-7B-Captioner-Relaxed model, specifically designed for image-to-text tasks.

Image-to-Text English

Vitucano 1b5 V1

ViTucano is a natively Portuguese pre-trained visual assistant that integrates visual understanding and language capabilities, suitable for multimodal tasks.

Transformers Other

Microsoft Git Base

GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.

Image-to-Text Supports Multiple Languages

BLIP Radiology Model

BLIP is a Transformer-based image captioning model capable of generating natural language descriptions for input images.

Vit GPT2 Image Captioning

An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.

Vit GPT2 Image Captioning Model

An image caption generation model based on the ViT-GPT2 architecture, capable of converting input images into descriptive text

Moondream Caption

A customized small vision model based on Moondream2, fine-tuned specifically for image caption generation tasks

This model is used to convert image content into textual descriptions and is suitable for non-commercial purposes.

Text Recognition

The Peacock Model is an Arabic multimodal large language model based on the InstructBLIP architecture, with AraLLaMA as its language model.

Image-to-Text Arabic

Llama 3 EZO VLM 1

A Japanese vision-language model based on Llama-3-8B-Instruct, enhanced with additional pretraining and instruction tuning for improved Japanese capabilities

Image-to-Text Japanese

Florence 2 Large Ft

Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based paradigm to handle various vision and vision-language tasks.

Florence 2 SD3 Captioner

Florence-2-SD3-Captioner is an image caption generation model based on the Florence-2 architecture, specifically designed for generating high-quality image captions.

Transformers Supports Multiple Languages

distilvit is an image-to-text model based on a VIT image encoder and a distilled GPT-2 text decoder, capable of generating textual descriptions of images.

Florence 2 Base Ft

Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.

Vit Base Patch16 224 Distilgpt2

DistilViT is an image caption generation model based on Vision Transformer (ViT) and distilled GPT-2, capable of converting images into textual descriptions.

Convllava JP 1.3b 1280

ConvLLaVA-JP is a Japanese vision-language model that supports high-resolution input and can engage in conversations about input images.

Transformers Japanese

Image Captioning Vit Gpt2 Flick8k

This model can convert input images into descriptive text, suitable for image understanding tasks in various scenarios.

This model is an image-to-text model based on the Apache-2.0 license, capable of converting image content into textual descriptions.

Text Recognition

Paligemma 3b Ft Scicap 448

PaliGemma is a multi-functional lightweight vision-language model that combines image and text inputs to generate text outputs and supports multiple languages.

Paligemma 3b Ft Scicap 224

PaliGemma is a lightweight vision-language model that combines image and text inputs to generate text outputs, supporting multilingual and multi-task processing.

Paligemma 3b Ft Ocrvqa 896

PaliGemma is a multi-functional lightweight vision-language model that supports image and text input and generates text output, suitable for various vision-language tasks.

Blip Image Captioning Base Bf16

This model is a quantized version of Salesforce/blip-image-captioning-base, reducing floating-point precision to bfloat16, cutting memory usage by 50%, and is suitable for image-to-text generation tasks.

This is a transformers-based image-to-text conversion model, specific functionalities require further details

Heron Chat Git Ja Stablelm Base 7b V1

A vision-language model capable of conversing about input images, supporting Japanese interaction

Transformers Japanese

UForm-Gen2-dpo is a small generative vision-language model, aligned for image caption generation and visual question answering tasks through Direct Preference Optimization (DPO) on VLFeedback and LLaVA-Human-Preference-10K preference datasets.

Transformers English

Moondream Prompt

A fine-tuned version of Moondream2, optimized for image prompt generation. It is a lightweight vision-language model suitable for efficient operation on edge devices.

A vision-language model based on VIT image encoder and distilled GPT-2 text decoder for image caption generation tasks

Git Base Minecraft

This is a vision-based image-to-text model capable of generating image descriptions.

Image Generation

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase